Write a “problem statement” and an introductory paragraph that clearly explains your goals, it
should include at least the following info:
● Describe your dataset, why you picked it, and write a small paragraph discussing your goal
with your dataset, what models you can use to analyze it, and why.
The goal of this project is to find the best way to characterize the variety of consumers that a wholesale distributor deals with. Since this is a Clustering Project I will use K-Means Clustering and Agglomerative Hierarchical Clustering and will compare results
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pickle
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
import plotly.graph_objs as pgo
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
myData=pd.read_csv("Wholesale customers data.csv")
myData.head(5)
| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 3 | 12669 | 9656 | 7561 | 214 | 2674 | 1338 |
| 1 | 2 | 3 | 7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
| 2 | 2 | 3 | 6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
| 3 | 1 | 3 | 13265 | 1196 | 4221 | 6404 | 507 | 1788 |
| 4 | 2 | 3 | 22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
myData.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 440 entries, 0 to 439 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Channel 440 non-null int64 1 Region 440 non-null int64 2 Fresh 440 non-null int64 3 Milk 440 non-null int64 4 Grocery 440 non-null int64 5 Frozen 440 non-null int64 6 Detergents_Paper 440 non-null int64 7 Delicassen 440 non-null int64 dtypes: int64(8) memory usage: 27.6 KB
There are a total of 440 observations and 8 attributes in the dataset. As we can see, every features have 440 non-null means that there are no missing Values
# printing the number of null values in each attribute
myData.isnull().sum()
Channel 0 Region 0 Fresh 0 Milk 0 Grocery 0 Frozen 0 Detergents_Paper 0 Delicassen 0 dtype: int64
myData.describe()
| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen | |
|---|---|---|---|---|---|---|---|---|
| count | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 |
| mean | 1.322727 | 2.543182 | 12000.297727 | 5796.265909 | 7951.277273 | 3071.931818 | 2881.493182 | 1524.870455 |
| std | 0.468052 | 0.774272 | 12647.328865 | 7380.377175 | 9503.162829 | 4854.673333 | 4767.854448 | 2820.105937 |
| min | 1.000000 | 1.000000 | 3.000000 | 55.000000 | 3.000000 | 25.000000 | 3.000000 | 3.000000 |
| 25% | 1.000000 | 2.000000 | 3127.750000 | 1533.000000 | 2153.000000 | 742.250000 | 256.750000 | 408.250000 |
| 50% | 1.000000 | 3.000000 | 8504.000000 | 3627.000000 | 4755.500000 | 1526.000000 | 816.500000 | 965.500000 |
| 75% | 2.000000 | 3.000000 | 16933.750000 | 7190.250000 | 10655.750000 | 3554.250000 | 3922.000000 | 1820.250000 |
| max | 2.000000 | 3.000000 | 112151.000000 | 73498.000000 | 92780.000000 | 60869.000000 | 40827.000000 | 47943.000000 |
printing the number of UNIQUE values in each attribute
myData.nunique()
Channel 2 Region 3 Fresh 433 Milk 421 Grocery 430 Frozen 426 Detergents_Paper 417 Delicassen 403 dtype: int64
1 Fresh
2 Milk
3 Grocery
4 Frozen
5 Detergents_Paper
6 Delicassen
1 Channel
2 Region
myData['Channel'] = myData['Channel'].map({1:'Horeca', 2:'Retail'})
myData['Region'].replace([1,2,3],['Houston','New York','Los Angeles'],inplace=True)
myData
| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen | |
|---|---|---|---|---|---|---|---|---|
| 0 | Retail | Los Angeles | 12669 | 9656 | 7561 | 214 | 2674 | 1338 |
| 1 | Retail | Los Angeles | 7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
| 2 | Retail | Los Angeles | 6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
| 3 | Horeca | Los Angeles | 13265 | 1196 | 4221 | 6404 | 507 | 1788 |
| 4 | Retail | Los Angeles | 22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 435 | Horeca | Los Angeles | 29703 | 12051 | 16027 | 13135 | 182 | 2204 |
| 436 | Horeca | Los Angeles | 39228 | 1431 | 764 | 4510 | 93 | 2346 |
| 437 | Retail | Los Angeles | 14531 | 15488 | 30243 | 437 | 14841 | 1867 |
| 438 | Horeca | Los Angeles | 10290 | 1981 | 2232 | 1038 | 168 | 2125 |
| 439 | Horeca | Los Angeles | 2787 | 1698 | 2510 | 65 | 477 | 52 |
440 rows × 8 columns
fig = px.box(myData, x='Region', y='Milk', color='Region')
fig.show()
In this plot, we have three different cities. Median Los Angeles residents consume 3.6k and Houston 3.7 and New York 2.3k
plt.figure(figsize=(15,10))
heatmap=sns.heatmap(myData.corr(),annot=True, fmt=".1g", vmin=-1, vmax=1, center=0, cmap="inferno", linewidths=1, linecolor="Black")
heatmap.set_title("Correlation HeatMap between variables")
heatmap.set_xticklabels(heatmap.get_xticklabels(),rotation=90)
plt.show()
The above heatmap shows the correlation between variables on each axis. We observe that there exists severe collinearity issues exist between Milk&Detergents_Paper,Grocery&Detergents_Paper and Milk&Grocery since there are correlations are > 0.7 and >0.9 between the two features.
#Which Region and which Channel seems to spend more?
#Which Region and which Channel seems to spend less?
myData['Total']=myData['Fresh']+ myData['Milk']+ myData['Grocery']+myData['Frozen']
+myData['Detergents_Paper']+myData['Delicassen']
fig = px.histogram(myData, x="Region", y="Total",
color='Channel', barmode='group',
histfunc='avg',
height=400)
fig.show()
As we can see diagram above, Los Angeles and Houston have the same number of Retails and New York slightly less. For Horeca, They have almost the same amount.
plt.figure(figsize=(20,30))
plt.subplot(3,3,1)
plt.title('Fresh Distribution')
sns.distplot(myData.Fresh)
plt.subplot(3,3,2)
plt.title('Milk Distribution')
sns.distplot(myData.Milk)
plt.subplot(3,3,3)
plt.title('Grocery Distribution')
sns.distplot(myData.Grocery)
plt.subplot(3,3,4)
plt.title('Frozen Distribution')
sns.distplot(myData.Frozen)
plt.subplot(3,3,5)
plt.title('Detergents Paper Distribution')
sns.distplot(myData.Detergents_Paper)
plt.subplot(3,3,6)
plt.title('Delicassen Distribution')
sns.distplot(myData.Delicassen)
plt.show()
Above we have distirubutions for our features
sns.pairplot(myData,diag_kind = 'kde')
plt.show()
It is a pair plot of our features described by different plots
fig = px.scatter(myData,x=myData['Fresh'], y=myData['Milk'],trendline="ols")
fig.show()
This diagram tells about the relationship between Milk and Fresh by scatterplot
fig = px.scatter(myData,x=myData['Frozen'], y=myData['Milk'],trendline="ols")
fig.show()
This diagram tells about the relationship between Milk and Froszen by scatterplot.
1 Yes, of course,Yes. Scaling Methods must scale them before features are supplied to clustering methods like K-means. Because Euclidean Distance is used to build cohorts in clustering algorithms, it's good to scale variables with heights in meters and weights in kilograms before computing the distance.
3.Yes, features variables are scaled
Since we are dealing with Clustering we don't need to use Test_train_split method
Now that you’ve really studied your data, are you going to take any preprocessing measures? For which columns, and why? Define any measures you’ve taken and address why you chose to do so.
Yes, I will take preprocessing measures for Channel, Region, and Total columns because they are not helpful when we model our data so, the easiest way is just drop those columns
myData.drop(labels='Channel', axis=1, inplace=True)
myData.drop(labels='Region', axis=1, inplace=True)
myData.drop(labels='Total', axis=1, inplace=True)
scaler = StandardScaler()
scaled_df = scaler.fit_transform(myData)
pd.DataFrame(scaled_df).describe()
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| count | 4.400000e+02 | 4.400000e+02 | 4.400000e+02 | 4.400000e+02 | 4.400000e+02 | 4.400000e+02 |
| mean | -2.422305e-17 | -1.589638e-17 | -6.030530e-17 | 1.135455e-17 | -1.917658e-17 | -8.276208e-17 |
| std | 1.001138e+00 | 1.001138e+00 | 1.001138e+00 | 1.001138e+00 | 1.001138e+00 | 1.001138e+00 |
| min | -9.496831e-01 | -7.787951e-01 | -8.373344e-01 | -6.283430e-01 | -6.044165e-01 | -5.402644e-01 |
| 25% | -7.023339e-01 | -5.783063e-01 | -6.108364e-01 | -4.804306e-01 | -5.511349e-01 | -3.964005e-01 |
| 50% | -2.767602e-01 | -2.942580e-01 | -3.366684e-01 | -3.188045e-01 | -4.336004e-01 | -1.985766e-01 |
| 75% | 3.905226e-01 | 1.890921e-01 | 2.849105e-01 | 9.946441e-02 | 2.184822e-01 | 1.048598e-01 |
| max | 7.927738e+00 | 9.183650e+00 | 8.936528e+00 | 1.191900e+01 | 7.967672e+00 | 1.647845e+01 |
model = KMeans(n_clusters=3,
init='k-means++',
n_init=10,
max_iter=300,
tol=0.0001,
precompute_distances='auto',
verbose=0,
random_state=42,
copy_x=True,
n_jobs=None,
algorithm='auto')
model.fit(scaled_df)
model.inertia_
1607.6748434206982
clusters = range(1, 20)
sse=[]
for cluster in clusters:
model = KMeans(n_clusters=cluster,
init='k-means++',
n_init=10,
max_iter=300,
tol=0.0001,
precompute_distances='auto',
verbose=0,
random_state=42,
copy_x=True,
n_jobs=None,
algorithm='auto')
model.fit(scaled_df)
sse.append(model.inertia_)
sse_df = pd.DataFrame(np.column_stack((clusters, sse)), columns=['cluster', 'SSE'])
fig, ax = plt.subplots(figsize=(13, 5))
ax.plot(sse_df['cluster'], sse_df['SSE'], marker='o')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Inertia or SSE')
Text(0, 0.5, 'Inertia or SSE')
model = KMeans(n_clusters=5,
init='k-means++',
n_init=10,
max_iter=300,
tol=0.0001,
precompute_distances='auto',
verbose=0,
random_state=42,
copy_x=True,
n_jobs=-1,
algorithm='auto')
model.fit(scaled_df)
KMeans(n_clusters=5, n_jobs=-1, precompute_distances='auto', random_state=42)
print('SSE: ', model.inertia_)
print('\nCentroids: \n', model.cluster_centers_)
pred = model.predict(scaled_df)
myData['cluster'] = pred
print('\nCount in each cluster: \n', myData['cluster'].value_counts())
SSE: 1058.77125325701 Centroids: [[ 1.65897027e+00 -1.08371983e-01 -2.17703067e-01 1.10347289e+00 -4.04601989e-01 3.33024950e-01] [-4.94431759e-01 6.87784611e-01 9.11873238e-01 -3.31564429e-01 9.07389458e-01 1.02422883e-01] [ 1.96681731e+00 5.17550306e+00 1.28721685e+00 6.90059988e+00 -5.54861977e-01 1.64784475e+01] [-2.30202959e-01 -3.83683148e-01 -4.36547623e-01 -1.65012833e-01 -3.97208366e-01 -1.93797294e-01] [ 3.13830315e-01 3.92190593e+00 4.27561037e+00 -3.57419457e-03 4.61816580e+00 5.03365339e-01]] Count in each cluster: 3 270 1 96 0 63 4 10 2 1 Name: cluster, dtype: int64
We can see that 4th cluster has maximum number of samples, while 5th cluster has minimum number of samples.
from sklearn.preprocessing import StandardScaler, normalize
scaler = StandardScaler()
scaled_df = scaler.fit_transform(myData)
normalized_df = normalize(scaled_df)
normalized_df = pd.DataFrame(data=normalized_df)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(normalized_df)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']
g=X_principal.mean()
g.head()
P1 2.800790e-17 P2 1.930274e-17 dtype: float64
import scipy.cluster.hierarchy as shc
plt.title('visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method ='ward')))
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
# where we'll save scores for later plotting
silhouette_scores = []
# testing different cluster values in range [2,8)
for n_cluster in range(2, 8):
silhouette_scores.append(silhouette_score(X_principal,
AgglomerativeClustering(n_clusters = n_cluster).fit_predict(X_principal)))
# Creating bar graph to compare the results. You can use a line plot if you prefer (similar to K Means lab)
plt.bar(x=range(2, 8), height=silhouette_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()
# creating and fitting model
agg = AgglomerativeClustering(n_clusters=3)
agg.fit(X_principal)
AgglomerativeClustering(n_clusters=3)
# Visualizing the clustering
plt.scatter(X_principal['P1'], X_principal['P2'],
c = AgglomerativeClustering(n_clusters = 3).fit_predict(X_principal))
plt.show()
By now, you should have completed any necessary scaling, encoding, preprocessing measures. Next steps would be creating and training your model. Split your data into training and testing. sets. Explain what 2 models you’ve chosen and why? Explain and define each model, giving background info, its uses, why it’s beneficial, why you chose it over another model, and compare both models you’ve chosen. Is your approach parametric or non parametric? Which features were most important? How did both models perform? Make sure you give all relevant info. Show the confusion matrix and classification report. Write a conclusion wrapping up your findings.
Before starting this project, I researched to find the best Clustering Machine learning algorithms. So, as we know that there are more than 100 Clustering algorithms, so on the one website
I saw top 7 ML Clustreing algorirthm:
1. Agglomerative Hierarchical Clustering
2. Balanced Iterative Reducing & Clustering
3. EM Clustering
4. Hierarchical Clustering
5. Density-Based Spatial Clustering
6. K-Means Clustering
7. Ordering Points To Identify the Structure of Clustering
So, from the above Clustering algorithms, I chose Agglomerative Hierarchical Clustering and K-Means Clusterin. The reason for choosing these algorithms is that we did a lab experiment, and in the lecture professor also explained to us how they work. In other words, only these two algorithms were familiar to me. So, we do have to scale before modeling for Clustering, but Clustering Algorithms do not require train_split_test because they are UNsupervised learning algorithms. Both ML algorithms are non-parametric the less critical features were "Channel" and "Region." K-Means Clustering gave SSE score of 1058.77125325701, while Agglomerative Clustering gave me three Clustering as you can see from above diagram. It is nicely divided into three groups.